Closed Bug 1598342 Opened 6 years ago Closed 5 years ago

TSan: data race nsprpub/pr/src/malloc/prmem.c:476:9 in PR_Free

Categories

(NSPR :: NSPR, defect, P5)

x86_64
Linux
defect

Tracking

(firefox75 fixed)

RESOLVED FIXED
Tracking Status
firefox75 --- fixed

People

(Reporter: decoder, Assigned: decoder)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

I'm seeing the attached race all over the place when trying to run Mochitests with TSan and it is by far the most frequent failure. I'm not sure if this can cause any actual problems like a hang during thread shutdown, or not.

If I understand correctly (someone should double check this):

  1. There's a thread T54 that is finished with its work and is cleaning up here
  2. The main thread is waiting in pthread_join for T54 to exit
  3. T54 exits and the main thread proceeds
  4. Main thread free()s the control data for T54

And TSan is complaining about a race between steps 1 and 4? Maybe it doesn't know about the join relationship in between? Is there a way to annotate this?

(This is assuming nothing super weird like if there was a mixup of thread ID and we're free() ing the wrong thing)

Blocks: tsan

(In reply to :dmajor from comment #2)

Maybe it doesn't know about the join relationship in between? Is there a way to annotate this?

TSan knows the primitives, it should not be necessary to annotate anything unless we use some kind of lock/sync mechanisms that are not builtin (which we don't, afaik).

To me it seems like if there is a race between 1. and 4., that could be for sure a real problem. If there is a race between the thread still reading its private data somehow and the main thread freeing it, I don't see how that would be expected. For now, I will suppress this, but to me this does not sound like expected behavior (but I am not an expert on our NSPR thread implementation).

To me it seems like if there is a race between 1. and 4., that could be for sure a real problem. If there is a race between the thread still reading its private data somehow and the main thread freeing it, I don't see how that would be expected.

Agreed for sure. But does such a race exist in this particular case? It seems like the main thread is known not to reach free() until after the other thread finishes step 1. But this is my first time looking at this code so I could be misunderstanding it.

(In reply to :dmajor from comment #4)

Agreed for sure. But does such a race exist in this particular case? It seems like the main thread is known not to reach free() until after the other thread finishes step 1. But this is my first time looking at this code so I could be misunderstanding it.

If TSan reports such a race, then there is from my perspective only two options:

  1. There is such a race.

or

  1. There is an invisible (to TSan) locking mechanism that synchronizes this.

So if the main thread is "known not to reach free() until after the other thread finishes step 1" I would naively ask "how?". If there is a synchronization primitive in place doing this, then TSan would know. If there is a custom synchronizing mechanism (e.g. using inline assembly stuff) implemented by us or the synchronization is happening e.g. through atomics/primitives in an uninstrumented part of the code, then this would explain the report and it might be a false positive.

So if the main thread is "known not to reach free() until after the other thread finishes step 1" I would naively ask "how?". If there is a synchronization primitive in place doing this, then TSan would know.

This is the mechanism:

  1. The main thread is waiting in pthread_join for T54 to exit

The main thread can't reach free() until it returns from pthread_join, which can't happen until T54 exits, which can't happen until _PR_DestroyThreadPrivate returns back to pr_root.

I don't know if that counts as a synchronization primitive.

This is a false positive from TSan.

The report says there's a read when we're destroying NSPR's thread specific things on a particular thread just before we exit the NSPR toplevel thread routine. It then complains that we're writing to that location during joining of a thread.

The sequence of events looks something like:

Main thread: request join of thread T1 through PR_JoinThread.
MT: waits somewhere in libpthread or similar.
T1: Finishes internal nsThread routine.
T1: returns to https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#201
T1: bookkeeping to clean up: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#203-235
T1: destroys internal NSPR thread stuff: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#248

We are now at the point where TSan records the initial read of the memory location.

T1: finishes destroying, exits _pt_root, returns back to whatever libpthread bits started the thread in the first place.
T1: signals any waiters that the thread has exited.

At this point, T1 is finished. No further memory activity can come from T1.

MT: gets woken up by said signal from the previous step.
MT: resumes: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#587
MT: calls _pt_thread_death_internal: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#590-594

Note that we're not running the thread-local data destructor(s) here, as per comment: the comment is referencing the point where TSan found the initial memory write.

MT: we free the underlying thread structure: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c#912
MT: we call free(3) on the pointer: https://searchfox.org/mozilla-central/source/nsprpub/pr/src/malloc/prmem.c#476
MT: we're writing to the pointer (?) in TSan's hooks (?) Or is this just our scribbling over freed memory in mozjemalloc or something?

So TSan is complaining that we're writing to memory that T1 read from, and that write could have been important to T1, so we should have synchronized. But we did synchronize: the main thread can't get to the point where it would be writing until T1 told it to proceed. TSan even intercepts pthread_join and pthread_create, so it should have some inkling of what's going on here.

The priority flag is not set for this bug.
:KaiE, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(kaie)

(In reply to Nathan Froyd [:froydnj] from comment #7)

This is a false positive from TSan.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(kaie)
Priority: -- → P2
Resolution: --- → INVALID

(In reply to Kai Engert (:KaiE:) from comment #9)

(In reply to Nathan Froyd [:froydnj] from comment #7)

This is a false positive from TSan.

It would be good to leave this open because we are still trying to mitigate this somehow.

Our current status is that this is due to how NSPR is built and that it can maybe be fixed in TSan but if not, we might have to use a custom fix in NSPR maybe. It is not clear yet how we can easily resolve this, but suppressing this race is very costly because it is so frequent.

I'm currently planning to test with a patched Clang to see if dvyukov's suggested change in TSan fixes the problem.

Assignee: nobody → choller
Status: RESOLVED → REOPENED
Priority: P2 → P5
Resolution: INVALID → ---

Just a quick status update: I am still in touch with the TSan developers to figure out if there is anything we can do to make this easily work.

If that doesn't work out, I just had another idea:

We would try to compile-time blacklist _pt_root and _PR_DestroyThreadPrivate by annotating them in the code. When building with TSan, we would also have to replace the memset in _PR_DestroyThreadPrivate with an inlined version because that is intercepted otherwise.

The problem has been diagnosed by the TSan developers and a fix is available at https://reviews.llvm.org/D74828

In short, the problem was caused by our use of called_from_lib suppression feature. Apparently, this feature caused TSan to disable some vital interceptors for pthread functions, which desynchronized the internal thread state of TSan.

I will try to backport the patch to our version of Clang 9 so we can close this bug. Getting rid of these races should give us a nice performance boost for TSan because the suppression stats show that this issue is by far the most frequent to occur.

Pushed by choller@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/99f74aa3cae9 Import TSan fix D74828 from Clang upstream. r=froydnj
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: